Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

Collecting Data

Overview

Its a well-known aphorism within data science that the majority of an analysis consists in cleaning and validating our data. If we can collect our data in a clean format from the start, it will allow us to proceed directly to the exploration stage once the data have been collected.

There are number of excellent articles that give an extensive overview of how to collect and organize data. Hadley Wickham’s “Tidy Data”, one of the most citing papers across all of data science, offers an extensive theoretical framework for describing a process for collecting data sets. Karl Broman and Kara Woo’s “Data Organization in Spreadsheets”. offers a balanced between practical advice and an extended discussion of general principles for collecting data sets. The Data Carpentry guide Data Organization in Spreadsheets provides a concise list of common pitfalls.

This notebook provides a summarized set advice for organizing and storing data within a spreadsheet program. Rather than an extensive discussion of various pros and cons, it primarily focuses on the explicit approaches that I recommend. For readers interested in a broader coverage, I suggest reading the articles cited above. Because we are not using any fancy spreadsheet functions here, any program that you would like to use should be fine. The screenshots come from Microsoft Excel, but the same approach will work in Google Sheets, LibreOffice, or another spreadsheet program of your choosing.

Rectangular data

We have frequently discussed the concept of a rectangular data set, with observations in rows and variables in columns. This is the same format that we will use to collect our data. The first thing you will need to do, then, is determine what things you are observing and what properties you would like to collect about each thing. If you are observing different different kinds of things, each of which has a different set of associated properties, you may need to store each set in a different table.

To match the format of rectangular data that we have been working with in R, we need to structure our data set with a single row of column names, followed by a row of data for each observation. Here is an example from Excel of a nonsense data set with three variables:

Notice that we need to always start in the first cell, A1, and fill in a consistent number of rows and columns. We do not have multiple tables scattered around the spreadsheet. We do not have multiple header columns. It is just the data itself, stored in an contained rectangle in the upper-left hand corner of our spreadsheet. It is okay to include a small amount of formatting of the cells to help with the data-entry process (I like the make the first row bold), but do not try to record measurable properties of your data with formatting.

Naming things

It is important to choose good variable names. As we have seen, the variable names in a data set are used to describe the graphics in the grammar of graphics and for manipulating data with verbs. If our names are too complex or difficult to remember, it will be more difficult to create data visualizations. When variables contain spaces or other special characters, it can become nearly impossible to work within R without first cleaning up the variable names after loading the data.

I find the best approach to variable names is to only use lower-case letters, numbers, and underscores. The underscores can be used in place of spaces, but do not make variable names more complex than needed. Also, make sure to start the name of a variable with a lower-case letter (starting with a number is invalid in R and many other programming languages). Note that I recommend using lowercase even for variable names that should be capitalized, such as acronyms and proper nouns. If you selectively capitalize, you will always need to remember where this was done. In my scheme, it is one less thing to remember.

Throughout the rest of this notebook, continuing with the food theme, we will show examples of a small data set collecting information about a set of cookbooks. Here is the table before filling in any information, with just the row names.

For the specific variables within a data set, spaces, capital letters, and other special characters are fine. Just be consistent. Do not use “female” in one observations, “Female” in another, and “F” in a third. Just pick the format and stick to it. Where applicable, try to use standards-based labels (ISO country codes or U.S. Postal service state abbreviations). This will help if you later want to merge your data set with other sources.

One thing in each cell

Within each cell, there should only be one piece of information. In particular, this means that cells should not contain units or current symbols. If you have data collected in different units, create a new column and put the units there. Though, it is best to store everything on the same scale. If there is something to note about a particular value, do not put a star or other mark with it. Create a new column. This also means that, as mentioned above, you should not try to store two things in one cell by using formatting to indicate a secondary piece of information. Again, if you have two things to indicate, create a new column. Here is an example of our cookbook data set with these principles applied:

If you need to include explanatory notes for some of the data, which is often a great idea, do not abandon the rectangular data format. Instead, include an extra column of notes with its own name. For example, here we explain that one of our books is out of print:

In our example table, the number of pages and weight of one book is missing because it is out of print. In order to indicate this, the corresponding cell is blank. Blank values are the only cross-software way to indicate missing values in a consistent way.

Dates

Date variables are a well-known source of error in the collecting and recording of data. My recommendation for the most flexible and least error-prone method is to simply record each component of a date as its own column. This means, one column for year, one for month, and one for day. It will be much easier to work with if you keep months as numbers rather than names. If you have time information as well, this can be recorded by putting the hours, minutes, and seconds as their own columns.

One benefit of this method is that it will be easy to record historical data in cases where you may not be sure of the month or day for every row of the data set. For example, here is a data set showing properties of the cookbook authors from our data set:

There is an ISO standard (ISO 8601) for representing dates and times in a consistent way: YYYY-MM-DD. This is often a great format for storing data, and one that is used in several example data sets for this book, but (1) can lead to errors when opening and re-saving in a spreadsheet program and (2) cannot easily store dates with unknown information. I suggest using the separate column approach while collecting your initial data set. Later, if you re-save a modified data set from within R, the ISO 8601 format is a good option.

Output format

Most sources on collecting data suggest storing your results in a plaintext format. This is a stripped down representation of the data that contains no formatting information and is application agnostic. Excel, GoogleSheets, LibreOffice, and any other spreadsheet program you come across, should be able to save your data set in a plaintext format. The most commonly used type is called a comma seperated value (or “csv”) file. Here, columns are split by commas and each row is on its own line. Here is what the csv file of our cookbook authors data set looks like:

variable,long_name,units,description
book_title,Book Title,,"Book name, given in English"
author,Author's name,,Authors given and family name
pages,Number of pages,,Pages in currently available edition
weight,Book weight,grams,Weight of currently available edition
nationality,Author's Nationality,,"Using ISO two letter country codes (i.e., CA)"
birth_year,Author's birth year,,As numeric variable
birth_month,Author's birth month,,As numeric variable
birth_day,Author's birth day,,As numeric variable

Nearly all of the data set provided with this book are stored as CSV files. One thing to be careful of, particularly when using Excel, is that if your computer is configured for a language that uses a comma as a decimal separator, the default csv output may actually use a semicolon (;) in place of a comma. To read these files in R, just replace the read_csv with the function read_csv2.

Unlike some other sources, I am less strict about the need to only export data as a plaintext file. This is the best way for sharing and storing a data set once an analysis is finished, but if you are going to continue adding and changing your data set, it may actually be preferable to store your data in an .xlsx file (it avoids errors that are introduced when converting back and forth between excel and plaintext formats). Data can be loaded directly from an excel file with the readxl package. Here is the syntax for using the package to read in a data set, with either the first sheet or a named sheet:

library(readxl)
data <- read_excel("authors.xlsx")                      # load the first sheet
data <- read_excel("authors.xlsx", sheet = "sheetname") # load sheet by name

Once you are finished with the data collection and cleaning processing, then it is a good idea to store your data in a plaintext format for sharing and long-term preservation.

Data dictionary

So far, our discussion has focused on the specifics of storing the data itself. It is also important to document exactly what information is being stored in your variables. To do this, we can construct a data dictionary. This should explain, for each variable, basic information such as the variable’s name, measurement units, and expected categorical values. Any decisions that needed to be made should also be documented. A data dictionary can be a simple text file, or can be stored itself as a structured data set. Here is an example of a data dictionary for our authors data set:

We included a long name for each variable, which will be useful when creating improved graphics labels when preparing data visualizations for publication.

Summary

There is a lot of information passed along within this notebook. For future reference, here are the key formatting guidelines for storing data sets in spreadsheets:

  • record a single row of column names, starting in cell A1, followed by rows of observations
  • only used lowercase letters, numbers, and underscores in column names; always start the name with a letter, and keep the variable names relatively short
  • when recording categorical variables, keep the codes consistent
  • only one thing in each cell; no units, currency symbols, or notes
  • keep a separate notes column if needed
  • blank cells within the rectangle are used if and only if the data is missing
  • save dates by storing year, month, day, hour, ect. as their own columns
  • save results as an xlsx file while in the middle of data collection; save as a csv file for long-term storage and sharing
  • create a data dictionary to record important information about your data set

As mentioned in the introduction, this is an opinionated list and some other options are equally valid. The most important thing, however, is consistency, so in this book I will expect that you follow the formatting advice here when collecting data.

Practice

Movies data

For this notebook, we will be creating two datasets, reading them into R, and producing several data visualizations.

For the first task, we will create a dataset as a class describing a set of movies. Each student will be tasked with creating three rows of data. We will record the following items in the GoogleSheets file linked to on the class website:

  • student name
  • title
  • date of release (year, month, and day)
  • budget
  • box-office numbers
  • genre
  • personal rating of the film, from 1-5 stars

Once you are finished, save the data set as a csv file, upload it to the “data” directory in RStudio, and read it into R in the following block of code:

movies <- read_csv(file.path("data", ".csv"))
movies

Then, produce a scatterplot below with the budget on the x-axis, box-office numbers on the y-axis, points colored by the year of release, and the size of the points corresponding to your personal rating. Use a text repel layer to label the movies by their title. Include axis labels and titles, along with units of measurement.

If you think it is needed, also include logarithmic scales for the x- and y-axes.

Friends data

For the second data set, you are going to collect information about ten of your friends on your own. You can do this in any software you would like. At a minimum, record the following variables:

  • their name
  • where you met them (make your own categories, such as “university”, “high school”, “camp”)
  • the year you met the friend
  • personal rating of the friend, from 1-5 stars

Once you are finished, save the data set as a csv file and read it into R in the following block of code:

friends <- read_csv(file.path("data", ".csv"))
friends

Then, produce a scatter plot below using any two variables that you find interesting as the x- and y-axes. Include a highlight layer showing a subset of your friends. These could be friends you met at a certain time in your life, highlighting your best friend(s) or something else that is interesting to you. Include axis labels and titles, along with units of measurement.